{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Tabular Time Series Related API Demo with NOAA Weather Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Copyright (c) Microsoft Corporation. All rights reserved. <br>\n",
        "Licensed under the MIT License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this notebook, you will learn how to use the Tabular Time Series related API to filter the data by time windows for sample data uploaded to Azure blob storage. \n",
        "\n",
        "The detailed APIs to be demoed in this script are:\n",
        "- Create Tabular Dataset instance\n",
        "- Assign timestamp column and partition timestamp column for Tabular Dataset to activate Time Series related APIs\n",
        "- Clear timestamp column and partition timestamp column\n",
        "- Filter in data before a specific time\n",
        "- Filter in data after a specific time\n",
        "- Filter in data in a specific time range\n",
        "- Filter in data for recent time range\n",
        "\n",
        "Besides above APIs, you'll also see:\n",
        "- Create and load a Workspace\n",
        "- Load weather data into Azure blob storage\n",
        "- Create and register weather data as a Tabular dataset\n",
        "- Re-load Tabular Dataset from your Workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Import Dependencies\n",
        "\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, run the cells below to install the Azure Machine Learning Python SDK and create an Azure ML Workspace that's required for this demo."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Prepare Environment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Print out your version of the Azure ML Python SDK. Version 1.0.60 or above is required for TabularDataset with timeseries attribute. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import azureml.core\n",
        "azureml.data.__version__"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Import Packages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# import packages\n",
        "from datetime import datetime, timedelta\n",
        "\n",
        "from azureml.core import Dataset, Workspace"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Set up Configuraton and Create Azure ML Workspace\n",
        "\n",
        "If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) first if you haven't already to establish your connection to the Azure ML Workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "ws = Workspace.from_config()\n",
        "dstore = ws.get_default_datastore()\n",
        "\n",
        "dset_name = 'weather-data-florida'\n",
        "\n",
        "print(ws.name, ws.resource_group, ws.location, ws.subscription_id, dstore.name, sep = '\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load Data to Blob Storage\n",
        "\n",
        "This demo uses 2019 weather data under within weather-data folder. You can replace this data with your own."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Upload data to blob storage so it can be used as a Dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "dstore.upload('weather-data', dset_name, overwrite=True, show_progress=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create & Register Tabular Dataset with time-series trait from Blob\n",
        "\n",
        "The API on Tabular datasets with time-series trait is specially designed to handle Tabular time-series data and time related operations more efficiently. By registering your time-series dataset, you are publishing your dataset to your workspace so that it is accessible to anyone with the same subscription id. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create Tabular Dataset instance from blob storage datapath.\n",
        "\n",
        "**TIP:** you can set virtual columns in the partition_format. I.e. if you partition the weather data by state and city, the path can be '/{STATE}/{CITY}/{partition_time:yyy/MM}/data.parquet'. STATE and CITY would then appear as virtual columns in the dataset, allowing for efficient filtering by these timestamps. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "datastore_path = [(dstore, dset_name + '/*/*/data.parquet')]\n",
        "dataset        = Dataset.Tabular.from_parquet_files(path=datastore_path, partition_format = dset_name + '/{partition_time:yyyy/MM}/data.parquet')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Assign \"datetime\" column as timestamp and \"partition_time\" from folder path as partition_timestamp for Tabular Dataset to activate Time Series related APIs. The column to be assigned should be a Date type, otherwise the assigning will fail."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd = dataset.with_timestamp_columns(timestamp='datetime', partition_timestamp='partition_time')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Register the dataset for easy access from anywhere in Azure ML and to keep track of versions, lineage. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# register dataset to Workspace\n",
        "registered_ds = tsd.register(ws, dset_name, create_new_version=True, description='Data for Tabular Dataset with time-series trait demo.', tags={ 'type': 'TabularDataset' })"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Reload the Dataset from Workspace"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# get dataset by dataset name\n",
        "tsd = Dataset.get_by_name(ws, name=dset_name)\n",
        "tsd.to_pandas_dataframe().head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Filter Data by Time Windows\n",
        "\n",
        "Once your data has been loaded into the notebook, you can query by time using the time_before(), time_after(), time_between(), and time_recent() functions.The filter is optimized to only load those data files within the partition_timestamp range when partition_timestamp is specified.\n",
        "\n",
        "include_boundary is default to be true for all the time series related filters, please pass include_boundary=False to exclude boundary."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Before Time Input"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# select data that occurs before a specified date\n",
        "tsd2 = tsd.time_before(datetime(2019, 6, 12))\n",
        "tsd2.to_pandas_dataframe().tail(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## After Time Input"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# select data that occurs after a specified date\n",
        "tsd2 = tsd.time_after(datetime(2019, 5, 30))\n",
        "tsd2.to_pandas_dataframe().head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Before & After Time Inputs\n",
        "\n",
        "You can chain time functions together."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# select data that occurs within a given time range\n",
        "tsd2 = tsd.time_after(datetime(2019, 1, 1)).time_before(datetime(2019, 1, 10))\n",
        "tsd2.to_pandas_dataframe().head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Time Range Input"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# another way to select data that occurs within a given time range\n",
        "tsd2 = tsd.time_between(start_time=datetime(2019, 1, 31, 23, 59, 59), end_time=datetime(2019, 2, 7))\n",
        "tsd2.to_pandas_dataframe().head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Time Recent Input"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This function takes in a datetime.timedelta and returns a dataset containing the data from datetime.now()-timedelta() to datetime.now()."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**NOTE:** This will return an empty dataframe there is no data within the last 2 days."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2 = tsd.time_recent(timedelta(days=2))\n",
        "tsd2.to_pandas_dataframe().tail(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Drop and keep columns\n",
        "\n",
        "You can also choose to drop or keep certain columns."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Drop Columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<font color=red>If a timeseries column is dropped, the corresponding capabilities will be dropped for the returned dataset.</font><br>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2 = tsd.drop_columns(columns=['snowDepth', 'version', 'datetime'])\n",
        "tsd2.take(5).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The exception is expected because dataset loses timeseries capabilities to do time travel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.exceptions import DatasetTimestampMissingError\n",
        "\n",
        "try:\n",
        "    tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)\n",
        "except DatasetTimestampMissingError as e:\n",
        "    print('Expected exception : {}'.format(str(e)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Drop will return dataset with timeseries capabilities if modify column list to exclude timestamp columns."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2 = tsd.drop_columns(columns=['snowDepth', 'version', 'upload_date'])\n",
        "tsd2.take(5).to_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Keep Columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<font color=red>If a timeseries column is not included, the timeseries capabilities will be dropped for the returned dataset.</font><br>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2 = tsd.keep_columns(columns=['snowDepth'], validate=False)\n",
        "tsd2.to_pandas_dataframe().tail()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The exception is expected because dataset loses timeseries capabilities to do time travel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "try:\n",
        "    tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)\n",
        "except DatasetTimestampMissingError as e:\n",
        "    print('Expected exception : {}'.format(str(e)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Keep will return dataset with timeseries capabilities if modify column list to include timestamp columns."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2 = tsd.keep_columns(columns=['snowDepth', 'datetime', 'partition_time'], validate=False)\n",
        "tsd2.to_pandas_dataframe().tail()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "tsd2.time_before(datetime(2019, 6, 12)).to_pandas_dataframe().tail(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Resetting Timestamp Columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Rules for reseting are:\n",
        "- You cannot assign 'None' to timestamp while assign a valid column name to partition_timestamp because partition_timestamp is optional while timestamp is mandatory for Tabular time series data.\n",
        "- If you assign 'None' to timestamp, then both timestamp and partition_timestamp will all be cleared.\n",
        "- If you assign only 'None' to partition_timestamp, then only partition_timestamp will be cleared."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from azureml.exceptions import UserErrorException\n",
        "# Illegal clearing, exception is expected.\n",
        "try:\n",
        "    tsd2 = tsd.with_timestamp_columns(timestamp=None, partition_timestamp='partition_time')\n",
        "except UserErrorException as e:\n",
        "    print('Cleaning not allowed because {}'.format(str(e)))\n",
        "\n",
        "# clear both\n",
        "tsd2 = tsd.with_timestamp_columns(timestamp=None, partition_timestamp=None)\n",
        "print('after clean both with None/None, timestamp columns are: {}'.format(tsd2.timestamp_columns))\n",
        "\n",
        "# clear partition_timestamp only and assign 'datetime' as timestamp column\n",
        "tsd2 = tsd2.with_timestamp_columns(timestamp='datetime', partition_timestamp=None)\n",
        "print('after clean partition timestamp column, timestamp columns are: {}'.format(tsd2.timestamp_columns))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/work-with-data/datasets-tutorial/datasets-tutorial.png)"
      ]
    }
  ],
  "metadata": {
    "authors": [
      {
        "name": "jamgan"
      }
    ],
    "category": "tutorial",
    "compute": [
      "Local"
    ],
    "datasets": [
      "NOAA"
    ],
    "deployment": [
      "None"
    ],
    "exclude_from_index": false,
    "framework": [
      "Azure ML"
    ],
    "friendly_name": "Filtering data using Tabular Timeseiries Dataset related API",
    "index_order": 1,
    "kernelspec": {
      "display_name": "Python 3.8 - AzureML",
      "language": "python",
      "name": "python38-azureml"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.10"
    },
    "notice": "Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License.",
    "star_tag": [
      "featured"
    ],
    "tags": [
      "Dataset",
      "Tabular Timeseries"
    ],
    "task": "Filtering"
  },
  "nbformat": 4,
  "nbformat_minor": 2
}